Skip to contents

Coevolutionary Analysis

At this point, we’ve walked through the steps to take a set of sequences and obtain a set of COGs with phylogenetic reconstructions for each COG. We’re now ready to look for signals of coevolution, which imply functional associations between COGs. These methods are implemented via the ProtWeaver class in SynExtend, which includes many commonly used methods for detecting coevolutionary patterns.

While the previous steps have only utilized a small subsample of the data, we’re now finally going to work with the complete dataset. This dataset is comprised of the 91 Micrococcus genomes available with assemblies at the Scaffold, Chromosome, or Complete level (link to query). Note that more genomes may become available after this conference; these are all that were available at the time.

We ran the complete pipeline of identifying and annotating genes with DECIPHER, finding COGs with SynExtend, and then creating gene trees for each COG using DECIPHER. The complete data consist of 3,407 distinct COGs. All of this analysis is performed entirely within SynExtend and DECIPHER; no external packages or data are required aside from the input genomes.

We now use the new ProtWeaver class to try to find COGs that show evidence of correlated evolutionary selective pressures, also referred to as ‘coevolutionary signal’.

library(DECIPHER)
library(SynExtend)
COGsFile <- system.file('extdata', 'CoevNetworks', 
                        'MicrococcusCOGs.RData',
                        package='CompGenomicsBioc2022')
AnnotationsFile <- system.file('extdata', 'CoevNetworks', 
                               'MicrococcusAnnotations.RData',
                        package='CompGenomicsBioc2022')
TreesFile <- system.file('extdata', 'CoevNetworks', 
                         'MicrococcusGeneTrees.RData',
                        package='CompGenomicsBioc2022')

# All of these files are formatted the same way, as lists with each entry
# corresponding to a COG. COGs entries are identifiers for each gene, CogsAnnot 
# entries are annotations, and CogTrees entries are gene trees
load(COGsFile)          # Loads 'COGs'
load(AnnotationsFile)   # Loads 'CogsAnnot'
load(TreesFile)         # Loads 'CogTrees'
# Let's look at one of the COGs
COGs[[7]]
##  [1] "1_1_919"     "12_44_1601"  "13_5_415"    "2_1_10"      "14_1_274"   
##  [6] "15_1_14"     "16_132_1767" "17_107_1177" "18_1_1825"   "19_1_1783"  
## [11] "20_1_958"    "21_223_1464" "22_44_1726"  "23_13_736"   "3_1_130"    
## [16] "24_11_767"   "25_1_915"    "26_8_258"    "27_53_1567"  "28_10_1132" 
## [21] "29_53_1900"  "30_1_72"     "31_7_808"    "32_7_458"    "33_183_1337"
## [26] "34_130_2076" "35_197_1194" "36_128_792"  "37_22_1142"  "38_81_1984" 
## [31] "4_1_9"       "39_1_1608"   "40_1_416"    "41_1_925"    "42_1_1032"  
## [36] "43_1_707"    "44_1_704"    "45_13_755"   "46_63_1530"  "47_1_970"   
## [41] "48_1_690"    "49_1_779"    "50_1_267"    "51_20_1253"  "52_1_971"   
## [46] "53_105_395"  "54_109_577"  "55_33_1488"  "56_11_533"   "57_161_2179"
## [51] "58_111_515"  "59_55_1658"  "5_1_9"       "60_50_1995"  "61_1_78"    
## [56] "62_8_204"    "63_155_1042" "64_47_712"   "65_96_2144"  "66_11_774"  
## [61] "67_66_1831"  "68_22_82"    "69_22_1206"  "70_34_781"   "71_96_2235" 
## [66] "72_1_1811"   "73_11_821"   "74_1_221"    "75_15_1093"  "76_1_2110"  
## [71] "77_1_696"    "78_1_1963"   "79_12_872"   "80_12_855"   "81_48_2136" 
## [76] "82_41_671"   "83_20_1289"  "84_185_549"  "85_42_1351"  "86_8_1238"  
## [81] "88_7_611"    "89_126_1611" "90_1_980"    "6_21_1244"   "7_40_2015"  
## [86] "8_1_348"     "9_5_575"     "10_1_382"    "11_1_10"     "11_12_1037" 
## [91] "91_1_985"
CogsAnnot[[7]]
##   A test set of class 'Taxa' with length 91
##      confidence name                 taxon
##  [1]       100% 50_1_267             Root; 09120 Genetic Information Processi...
##  [2]       100% 51_20_1253           Root; 09120 Genetic Information Processi...
##  [3]       100% 52_1_971             Root; 09120 Genetic Information Processi...
##  [4]        71% 53_105_395           Root; 09120 Genetic Information Processi...
##  [5]       100% 54_109_577           Root; 09120 Genetic Information Processi...
##  ...        ... ...                  ...
## [87]       100% 45_13_755            Root; 09120 Genetic Information Processi...
## [88]       100% 46_63_1530           Root; 09120 Genetic Information Processi...
## [89]       100% 47_1_970             Root; 09120 Genetic Information Processi...
## [90]       100% 48_1_690             Root; 09120 Genetic Information Processi...
## [91]       100% 49_1_779             Root; 09120 Genetic Information Processi...
CogsAnnot[[7]][[1]]
## $taxon
## [1] "Root"                                                             
## [2] "09120 Genetic Information Processing"                             
## [3] "09122 Translation"                                                
## [4] "03010 Ribosome [PATH:ko03010]"                                    
## [5] "K02899  RP-L27, MRPL27, rpmA, large subunit ribosomal protein L27"
## 
## $confidence
## [1] 100 100 100 100 100
CogTrees[[7]]
## 'dendrogram' with 2 branches and 91 members total, at height 0.3813559
# Note that tree labels correspond to gene identifiers
setequal(COGs[[7]], labels(CogTrees[[7]]))
## [1] TRUE

There is a ton of data here, and we unfortunately don’t have time to look at all of it. To demonstrate some of the things we can do with ProtWeaver, we’re going to look at subset of the data that is easier to investigate.

We’ll subset the COGs to ones that meet the following characteristics:

  • COGs with no paralogs
  • COGs not part of the core genome
  • COGs present in at least 5 genomes (not singletons or super rare ones)
  • COGs with at least one high confidence annotation
  • COGs that imply a coding region

These are relatively arbitrary requirements, but they subset the data to an example that runs quickly and is easily understandable. This essentially takes a group of COGs that are not essential to the organisms but tend to appear a lot, that have been characterized individually before.

## Subsetting COGs

# Cutoff values (91 genomes total!)
CoreCutoff <- 88
UnderCutoff <- 5

# Get assembly identifiers for each COG
truncCOGs <- lapply(COGs, \(x) sort(as.integer(gsub('([0-9]*)_.*', '\\1', x))))

# Find COGs without paralogs
noParas <- sapply(truncCOGs, \(x) length(x) == length(unique(x)))

# Get non-core genome
notCoreCOGs <- sapply(truncCOGs, \(x) length(unique(x)) < CoreCutoff)

# Get genes in more than 5 organisms
notSingles <- sapply(truncCOGs, \(x) length(unique(x)) > UnderCutoff)

# Make sure COGs are coding elements
codingCOGs <- sapply(CogsAnnot, \(x) is(x, 'Taxa'))

# At least one high confidence annotation
highConf <- sapply(CogsAnnot, \(x) 
                   if(is(x, 'Taxa')) 
                     max(sapply(x, \(y) 
                                y$confidence[length(y$confidence)])) > 50
                   else FALSE)

# Subset for the workshop
WorkshopSubset <- noParas & notCoreCOGs & notSingles & codingCOGs & highConf

# Subset our data
WCogs <- COGs[WorkshopSubset]
WAnnots <- CogsAnnot[WorkshopSubset]
WTrees <- CogTrees[WorkshopSubset]

Let’s also put together a list of consensus annotations for each COG.

consAnnots <- vector('character', length=length(CogsAnnot))
for ( i in seq_along(CogsAnnot) ){
  taxaentry <- CogsAnnot[[i]]
  if (!is(taxaentry, 'Taxa'))
    consAnnots[[i]] <- 'NONCODING'
  else {
    annots <- sapply(taxaentry, \(y) y$taxon[length(y$taxon)])
    annots <- annots[annots!='unclassified_Root']
    if (length(annots) == 0)
      consAnnots[[i]] <- 'PNACT'
    else
      consAnnots[[i]] <- names(sort(table(annots), decreasing=T))[1]
  }
}

# Remove this when we get the dev branch of SynExtend
consAnnots <- consAnnots[WorkshopSubset]

Now we can make our ProtWeaver object. ProtWeaver has multiple input options, either a list formatted like COGs (list with gene identifiers) or a list like CogTrees (list with gene trees).

pw <- ProtWeaver(WTrees)

The ProtWeaver constuctor automatically detects the type of data you have and adjusts available predictors accordingly. While it functions best with a list of dendrograms for each COG, it can also run with simple presence/absence patterns. See the documentation file for ProtWeaver for more information on this functionality.

We’re now ready to make predictions. Predicting functional associations is done with the predict.ProtWeaver S3 method. Let’s examine possible functional associations between the COGs we have.

preds <- predict(pw)
print(preds)
## a ProtWeb object.
##  Method used: Ensemble 
##  Number of genes: 88 
##  Number of predictions: 3916

Viewing our results

Notice that preds is a ProtWeb object. This is just a simple S3 class with a pretty print method wrapping a matrix of pairwise association scores. We can get the raw data with GetProtWebData():

# Subset so the output is actually readable
GetProtWebData(preds)[1:7, 1:7]
##           1         2         3         4         5         6         7
## 1 1.0000000 0.2991649 0.2026004 0.2609102 0.2583244 0.2565627 0.2451448
## 2 0.2991649 1.0000000 0.1677659 0.3111661 0.2310440 0.2988038 0.2406802
## 3 0.2026004 0.1677659 1.0000000 0.2175542 0.2198164 0.1922557 0.2159145
## 4 0.2609102 0.3111661 0.2175542 1.0000000 0.2177652 0.2634777 0.2348395
## 5 0.2583244 0.2310440 0.2198164 0.2177652 1.0000000 0.3015111 0.2040384
## 6 0.2565627 0.2988038 0.1922557 0.2634777 0.3015111 1.0000000 0.1552947
## 7 0.2451448 0.2406802 0.2159145 0.2348395 0.2040384 0.1552947 1.0000000

The ProtWeb class will be updated next release cycle to include more methods, including a custom plotting function. The current plot.ProtWeb S3 method implements a force-directed embedding of the pairwise scores, but it’s a big work-in-progress. Stay tuned for the next release cycle for more functionality regarding ProtWeb.

In the meantime, we can use the igraph package to find clusters of coevolving COGs.

set.seed(123)

adjMatrix <- GetProtWebData(preds)
g <- graph_from_adjacency_matrix(adjMatrix, weighted=TRUE,
                                 mode='undirected', diag=FALSE)

clusters <- cluster_fast_greedy(g)

# Getting the clusters & identifying COGs by consensus annotation
clusterLabels <- vector('list', length(clusters))
for ( i in seq_along(clusterLabels) ){
  cluster <- communities(clusters)[[i]]
  labs <- consAnnots[as.integer(cluster)]
  clusterLabels[[i]] <- labs[order(sapply(labs, \(x) strsplit(x, ' ')[[1]][3]))]
}
clusterLabels
## [[1]]
## [1] "K07241  hoxN, nixA, nickel/cobalt transporter (NiCoT) family protein"
## [2] "K01430  ureA, urease subunit gamma [EC:3.5.1.5]"                     
## [3] "K01429  ureB, urease subunit beta [EC:3.5.1.5]"                      
## [4] "K01428  ureC, urease subunit alpha [EC:3.5.1.5]"                     
## [5] "K03190  ureD, ureH, urease accessory protein"                        
## [6] "K03188  ureF, urease accessory protein"                              
## [7] "K03189  ureG, urease accessory protein"                              
## 
## [[2]]
##  [1] "K01992  ABC-2.P, ABC-2 type transport system permease protein"                                                      
##  [2] "K02012  afuA, fbpA, iron(III) transport system substrate-binding protein"                                           
##  [3] "K02011  afuB, fbpB, iron(III) transport system permease protein"                                                    
##  [4] "K23777  cecR, TetR/AcrR family transcriptional regulator, regulator of cefoperazone and chloramphenicol sensitivity"
##  [5] "K19132  csb2, CRISPR-associated protein Csb2"                                                                       
##  [6] "K19133  csb3, CRISPR-associated protein Csb3"                                                                       
##  [7] "K00425  cydA, cytochrome bd ubiquinol oxidase subunit I [EC:7.1.1.7]"                                               
##  [8] "K00426  cydB, cytochrome bd ubiquinol oxidase subunit II [EC:7.1.1.7]"                                              
##  [9] "K01714  dapA, 4-hydroxy-tetrahydrodipicolinate synthase [EC:4.3.3.7]"                                               
## [10] "K01424  E3.5.1.1, ansA, ansB, L-asparaginase [EC:3.5.1.1]"                                                          
## [11] "K00059  fabG, OAR1, 3-oxoacyl-[acyl-carrier protein] reductase [EC:1.1.1.100]"                                      
## [12] "K21562  flp, CRP/FNR family transcriptional regulator, anaerobic regulatory protein"                                
## [13] "K00865  glxK, garK, glycerate 2-kinase [EC:2.7.1.165]"                                                              
## [14] "K00529  hcaD, 3-phenylpropionate/trans-cinnamate dioxygenase ferredoxin reductase component [EC:1.18.1.3]"          
## [15] "K01154  hsdS, type I restriction enzyme, S subunit [EC:3.1.21.3]"                                                   
## [16] "K13640  hspR, MerR family transcriptional regulator, heat shock protein HspR"                                       
## [17] "K14059  int, integrase"                                                                                             
## [18] "K02351  K02351, putative membrane protein"                                                                          
## [19] "K07112  K07112, uncharacterized protein"                                                                            
## [20] "K07118  K07118, uncharacterized protein"                                                                            
## [21] "K07454  K07454, putative restriction endonuclease"                                                                  
## [22] "K09167  K09167, uncharacterized protein"                                                                            
## [23] "K00639  kbl, GCAT, glycine C-acetyltransferase [EC:2.3.1.29]"                                                       
## [24] "K01338  lon, ATP-dependent Lon protease [EC:3.4.21.53]"                                                             
## [25] "K00661  maa, maltose O-acetyltransferase [EC:2.3.1.79]"                                                             
## [26] "K11603  mntA, manganese transport system ATP-binding protein [EC:7.2.2.5]"                                          
## [27] "K11602  mntB, manganese transport system permease protein"                                                          
## [28] "K11601  mntC, manganese transport system substrate-binding protein"                                                 
## [29] "K18917  mrx1, mycoredoxin [EC:1.20.4.3]"                                                                            
## [30] "K05846  opuBD, osmoprotectant transport system permease protein"                                                    
## [31] "K01912  paaK, phenylacetate-CoA ligase [EC:6.2.1.30]"                                                               
## [32] "K03497  parB, spo0J, ParB family transcriptional regulator, chromosome partitioning protein"                        
## [33] "K00448  pcaG, protocatechuate 3,4-dioxygenase, alpha subunit [EC:1.13.11.3]"                                        
## [34] "K15894  pseB, UDP-N-acetylglucosamine 4,6-dehydratase [EC:4.2.1.115]"                                               
## [35] "K03655  recG, ATP-dependent DNA helicase RecG [EC:3.6.4.12]"                                                        
## [36] "K01156  res, type III restriction enzyme [EC:3.1.21.5]"                                                             
## [37] "K06400  spoIVCA, site-specific DNA recombinase"                                                                     
## [38] "K03111  ssb, single-strand DNA-binding protein"                                                                     
## [39] "K03455  TC.KEF, monovalent cation:H+ antiporter-2, CPA2 family"                                                     
## [40] "K03311  TC.LIVCS, branched-chain amino acid:cation transporter, LIVCS family"                                       
## [41] "K03306  TC.PIT, inorganic phosphate transporter, PiT family"                                                        
## [42] "K07344  trbL, type IV secretion system protein TrbL"                                                                
## [43] "K00384  trxB, TRR, thioredoxin reductase (NADPH) [EC:1.8.1.9]"                                                      
## [44] "K03672  trxC, thioredoxin 2 [EC:1.8.1.8]"                                                                           
## [45] "K16710  wcaK, amsJ, colanic acid/amylovoran biosynthesis protein"                                                   
## [46] "K07459  ybjD, putative ATP-dependent endonuclease of the OLD family"                                                
## 
## [[3]]
##  [1] "K02050  ABC.SN.P, NitT/TauT family transport system permease protein"                             
##  [2] "K06320  cgeB, spore maturation protein CgeB"                                                      
##  [3] "K00956  cysN, sulfate adenylyltransferase subunit 1 [EC:2.7.7.4]"                                 
##  [4] "K04047  dps, starvation-inducible DNA-binding protein"                                            
##  [5] "K00301  E1.5.3.1, sarcosine oxidase [EC:1.5.3.1]"                                                 
##  [6] "K00221  E4.99.1.2, alkylmercury lyase [EC:4.99.1.2]"                                              
##  [7] "K03297  emrE, qac, mmr, smr, small multidrug resistance pump"                                     
##  [8] "K01500  fchA, methenyltetrahydrofolate cyclohydrolase [EC:3.5.4.9]"                               
##  [9] "K02217  ftnA, ftn, ferritin [EC:1.16.3.2]"                                                        
## [10] "K02440  GLPF, glycerol uptake facilitator protein"                                                
## [11] "K06975  K06975, uncharacterized protein"                                                          
## [12] "K07095  K07095, uncharacterized protein"                                                          
## [13] "K07729  K07729, putative transcriptional regulator"                                               
## [14] "K03719  lrp, Lrp/AsnC family transcriptional regulator, leucine-responsive regulatory protein"    
## [15] "K23518  MACROD, ymdB, O-acetyl-ADP-ribose deacetylase [EC:3.1.1.106]"                             
## [16] "K00549  metE, 5-methyltetrahydropteroyltriglutamate--homocysteine methyltransferase [EC:2.1.1.14]"
## [17] "K03517  nadA, quinolinate synthase [EC:2.5.1.72]"                                                 
## [18] "K00767  nadC, QPRT, nicotinate-nucleotide pyrophosphorylase (carboxylating) [EC:2.4.2.19]"        
## [19] "K05846  opuBD, osmoprotectant transport system permease protein"                                  
## [20] "K15866  paaG, 2-(1,2-epoxy-1,2-dihydrophenyl)acetyl-CoA isomerase [EC:5.3.3.18]"                  
## [21] "K00074  paaH, hbd, fadB, mmgB, 3-hydroxybutyryl-CoA dehydrogenase [EC:1.1.1.157]"                 
## [22] "K00156  poxB, pyruvate dehydrogenase (quinone) [EC:1.2.5.1]"                                      
## [23] "K03343  puo, putrescine oxidase [EC:1.4.3.10]"                                                    
## [24] "K02914  RP-L34, MRPL34, rpmH, large subunit ribosomal protein L34"                                
## [25] "K12510  tadB, tight adherence protein B"                                                          
## [26] "K12511  tadC, tight adherence protein C"                                                          
## [27] "K01975  thpR, RNA 2',3'-cyclic 3'-phosphodiesterase [EC:3.1.4.58]"                                
## [28] "K07260  vanY, zinc D-Ala-D-Ala carboxypeptidase [EC:3.4.17.14]"                                   
## [29] "K22736  VIT, vacuolar iron transporter family protein"                                            
## [30] "K16870  wbbL, N-acetylglucosaminyl-diphospho-decaprenol L-rhamnosyltransferase [EC:2.4.1.289]"    
## [31] "K19159  yefM, antitoxin YefM"                                                                     
## [32] "K19158  yoeB, toxin YoeB [EC:3.1.-.-]"                                                            
## 
## [[4]]
## [1] "K15770  cycB, ganO, arabinogalactan oligomer / maltooligosaccharide transport system substrate-binding protein"
## [2] "K05499  cytR, LacI family transcriptional regulator, repressor for deo operon, udp, cdd, tsx, nupC, and nupG"  
## [3] "K15772  ganQ, arabinogalactan oligomer / maltooligosaccharide transport system permease protein"

Methods Implemented in ProtWeaver

By default, predict.ProtWeaver makes an ensemble prediction using as many individual models as it can run with the data provided. However, users are free to use any of the individual models without the ensemble predictor. The methods implemented are the following:

# PHYLOGENETIC PROFILING METHODS:
  ## P/A = Presence/Absence Profiles
  ## Jaccard distance of P/A
Jaccard <- predict(pw, method='Jaccard') 

  ## Hamming distance of P/A
Hamming <- predict(pw, method='Hamming') 

  ## MutualInformation of P/A
MutualInf <- predict(pw, method='MutualInformation')

  ## Direct Coupling Analysis of P/A
ProfDCA <- predict(pw, method='ProfDCA') 

  ## Correlation of gain/loss events on phylogeny, requires Species Tree
Behdenna <- predict(pw, method='Behdenna', mySpeciesTree=exSpeciesTree)

# CO-LOCALIZATION METHODS:
Colocalization <- predict(pw, method='Coloc') # Co-localization analysis

# DISTANCE MATRIX METHDOS:
MirrorTree <- predict(pw, method='MirrorTree')
ContextTree <- predict(pw, method='ContextTree')

# Residue Methods: (ONLY AVAILABLE IN DEV VERSION)
#   ## MutualInf of residues
ResidueMI <- predict(pw, method='ResidueMI')